[serve][llm][transcription] Add support for Transcription in vLLM engine backend#57194
[serve][llm][transcription] Add support for Transcription in vLLM engine backend#57194kouroshHakha merged 41 commits intoray-project:masterfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a new transcription API, following the OpenAI specification. The changes are well-structured, touching the necessary model definitions, LLM server, vLLM engine, and router components. The implementation largely follows existing patterns in the codebase. However, I've identified a couple of critical issues that would cause runtime errors, such as a missing comma in a type hint and a method name mismatch between the server and the engine. There are also some minor maintainability issues like a copy-pasted comment and a typo in a docstring. Addressing these points will make the PR ready for merging.
python/ray/llm/_internal/serve/deployments/llm/vllm/vllm_engine.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Pull Request Overview
This PR introduces support for a transcription API to vLLM's OpenAI-compatible interface, following the OpenAI audio/transcriptions API specification. The implementation adds the necessary request/response models, router endpoints, and engine integration to handle audio transcription requests.
- Adds TranscriptionRequest, TranscriptionResponse, and TranscriptionStreamResponse models
- Implements
/v1/audio/transcriptionsendpoint in the router - Integrates transcription support into the vLLM engine with proper error handling
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| python/ray/serve/llm/openai_api_models.py | Adds public API models for transcription request/response types |
| python/ray/llm/_internal/serve/deployments/routers/router.py | Implements transcription endpoint and updates request processing logic |
| python/ray/llm/_internal/serve/deployments/llm/vllm/vllm_engine.py | Adds transcription engine integration with vLLM OpenAI serving |
| python/ray/llm/_internal/serve/deployments/llm/llm_server.py | Adds transcription method to LLM server with async generator interface |
| python/ray/llm/_internal/serve/configs/openai_api_models.py | Defines internal transcription models and response type unions |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
kouroshHakha
left a comment
There was a problem hiding this comment.
Nice. I think the basic feature looks good. We need to just add CI tests and some release tests as well.
For CI please take a look at existing tests for the endpoints at engine and router levels. Here are some I found:
- https://github.com/ray-project/ray/blob/master/python/ray/llm/tests/serve/cpu/deployments/llm/test_llm_engine.py: includes tests for engine interfaces
- https://github.com/ray-project/ray/blob/master/python/ray/llm/tests/serve/cpu/deployments/llm/test_llm_server.py: includes tests for LLMServer interfaces.
You would need to create a mock engine with some reasonable transcription behavior.
Let's keep the translation for another PR after we cover everything for this new endpoint.
For release test, could you share the serve run script that you used to validate the behavior along with the client code and expected output. We can turn that into a gpu release test with a real model (maybe using whisper-tiny, etc) so that it is continuously tested.
|
@kouroshHakha CI tests have been written and docs have also been updated. Pls check and verify. If we are going to adopt vllm==v0.11.0, then v0 has been entirely been depricated and all models are supported via the v1 engine. Need to make appropriate changes for docs, etc(for e.g., embeddings). |
kouroshHakha
left a comment
There was a problem hiding this comment.
Adding @eicherseiji for review.
eicherseiji
left a comment
There was a problem hiding this comment.
Recommend pip install pre-commit && pre-commit install before a lint commit to satisfy the CI.
For a release test, recommend
andray/release/release_tests.yaml
Line 3781 in 067c02a
Looks like we're in pretty good shape though. Just a few comments + release test and we should be good. Thanks!
Bug: Batched Response Type Handling IssueThe condition |
Bug: Streaming Response Type InconsistencyThe type annotations for |
Bug: BugThe |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a new transcription API, which is a fantastic addition to the LLM serving capabilities. The implementation is well-structured, with changes spanning from the public API models to the vLLM engine, and includes comprehensive tests. The refactoring in ingress.py to generalize request processing is a notable improvement for maintainability. I have a few suggestions regarding potential memory usage with large audio files, improving the clarity of the documentation example, and fixing a broken link in a TODO comment. Overall, this is a solid contribution.
| voxtral_llm_config = LLMConfig( | ||
| model_loading_config={ | ||
| "model_id": "voxtral-mini", | ||
| "model_source": "mistralai/Voxtral-Mini-3B-2507", | ||
| }, | ||
| deployment_config={ | ||
| "autoscaling_config": { | ||
| "min_replicas": 1, | ||
| "max_replicas": 2, | ||
| } | ||
| }, | ||
| accelerator_type="A10G", | ||
| # You can customize the engine arguments (e.g. vLLM engine kwargs) | ||
| engine_kwargs={ | ||
| "tokenizer_mode": "mistral", | ||
| "config_format": "mistral", | ||
| "load_format": "mistral", | ||
| }, | ||
| log_engine_metrics=True, | ||
| ) | ||
|
|
||
| app = build_openai_app({"llm_configs": [whisper_llm_config, voxtral_llm_config]}) |
There was a problem hiding this comment.
This example, which is focused on transcriptions, also includes the configuration for a second, unrelated model (voxtral-mini). This could be confusing for users who are looking for a minimal, focused example on how to set up a transcription service.
For better clarity and to make the example easier to copy and adapt, I recommend removing the voxtral_llm_config and updating the app creation on line 80 to only use the whisper_llm_config, like this:
app = build_openai_app({"llm_configs": [whisper_llm_config]})This will make the example more direct and easier to follow for the specific use case of transcriptions.
There was a problem hiding this comment.
I agree, should we just keep voxtral_llm_config which also shows the engine_kwargs?
| raw_request = self._create_raw_request(request, "/audio/transcriptions") | ||
|
|
||
| # Extract audio data from the request file | ||
| audio_data = await request.file.read() |
There was a problem hiding this comment.
Reading the entire audio file into memory with await request.file.read() could lead to high memory consumption, especially with large audio files (the OpenAI API limit is 25MB) and concurrent requests. This might risk Out-Of-Memory errors on the replica.
If the underlying create_transcription method in vLLM supports it, consider streaming the file content instead of reading it all at once. This could be done by passing a file-like object or an async generator. If vLLM requires the full byte string, this is an acceptable limitation, but it's an important performance consideration to be aware of.
kouroshHakha
left a comment
There was a problem hiding this comment.
Nice. Looks great. Just a few nits before wrapping up this PR:
| "async-timeout; python_version < '3.11'", | ||
| "typer", | ||
| "meson", | ||
| "pybind11", |
There was a problem hiding this comment.
@eicherseiji a chore we should do is to make this part of setup.py read the llm-requirements.txt directly so we just update one source of truth down the line.
| voxtral_llm_config = LLMConfig( | ||
| model_loading_config={ | ||
| "model_id": "voxtral-mini", | ||
| "model_source": "mistralai/Voxtral-Mini-3B-2507", | ||
| }, | ||
| deployment_config={ | ||
| "autoscaling_config": { | ||
| "min_replicas": 1, | ||
| "max_replicas": 2, | ||
| } | ||
| }, | ||
| accelerator_type="A10G", | ||
| # You can customize the engine arguments (e.g. vLLM engine kwargs) | ||
| engine_kwargs={ | ||
| "tokenizer_mode": "mistral", | ||
| "config_format": "mistral", | ||
| "load_format": "mistral", | ||
| }, | ||
| log_engine_metrics=True, | ||
| ) | ||
|
|
||
| app = build_openai_app({"llm_configs": [whisper_llm_config, voxtral_llm_config]}) |
There was a problem hiding this comment.
I agree, should we just keep voxtral_llm_config which also shows the engine_kwargs?
| """ | ||
| pass | ||
|
|
||
| @abc.abstractmethod |
There was a problem hiding this comment.
I think we should make this method not abstract. So the subclasses can skip their implementation. We can have a NotImplementedError here. I think this points apply to all the endpoints actually, but came to my mind right now :)
|
making sure release tests pass: https://buildkite.com/ray-project/release/builds/65185 |
Signed-off-by: DPatel_7 <dpatel@gocommotion.com>
|
@kouroshHakha made appropriate changes from the review. |
|
Requested stamp from @richardliaw |
|
@Blaze-DSP the gpu test on the example is failing now (probably due to changing to a different model?) can you take a look. It has a gpu memory problem. |
| AsyncGenerator[ | ||
| Union[str, ChatCompletionStreamResponse, ChatCompletionResponse, ErrorResponse], | ||
| None, | ||
| ], |
There was a problem hiding this comment.
Bug: Type Mismatch in Streaming Responses
The LLM*Response type annotations (e.g., LLMChatResponse, LLMCompletionsResponse, LLMTranscriptionResponse) include stream response object types. However, streaming endpoints ultimately yield SSE-formatted strings, not these objects. This creates a mismatch between the declared types and the actual runtime output, which can lead to type checking issues and confusion.
Additional Locations (1)
| "not set it, a random_uuid will be generated. This id is used " | ||
| "through out the inference process and return in response." | ||
| ), | ||
| ) |
There was a problem hiding this comment.
Bug: Request ID Conflict in Transcription Classes
The TranscriptionRequest class explicitly adds a request_id field. This could conflict with vLLMTranscriptionRequest if the base class already defines it, potentially causing duplication or unexpected behavior. This also differs from other request types that inherit request_id from their vLLM base classes.
| set( | ||
| [ | ||
| "vllm>=0.11.0", | ||
| "vllm[audio]>=0.11.0", |
There was a problem hiding this comment.
Bug: Unnecessary Audio Dependencies in Ray[LLM]
The ray[llm] extra now unconditionally pulls in vllm[audio], adding audio-related dependencies. This increases the dependency footprint for all ray[llm] users, even those not needing transcription features, making the installation heavier than necessary.
| "async-timeout; python_version < '3.11'", | ||
| "typer", | ||
| "meson", | ||
| "pybind11", |
There was a problem hiding this comment.
Bug: Setup.py and Requirements.txt Mismatch
The setup.py adds "meson" and "pybind11" to the llm extras, but these dependencies are not present in the corresponding llm-requirements.txt file (lines 14-15 in the requirements file). While the comment on line 372 states "Keep this in sync with python/requirements/llm/llm-requirements.txt", these two packages are only added to setup.py and not to the requirements file, creating an inconsistency between the two dependency specifications.
| raw_request = self._create_raw_request(request, "/audio/transcriptions") | ||
|
|
||
| # Extract audio data from the request file | ||
| audio_data = await request.file.read() |
There was a problem hiding this comment.
Bug: Audio File Reading and Pointer Issue
The code calls await request.file.read() to extract audio data from the request file. However, this reads the entire file into memory, which could be problematic for large audio files. Additionally, after reading the file once, the file pointer is at the end, so if vLLM's create_transcription method tries to read from request.file again, it will get empty data. The audio_data is extracted but the original request.file is still passed to create_transcription, which may cause issues if vLLM expects an unread file object.
Signed-off-by: DPatel_7 <dpatel@gocommotion.com>
| "not set it, a random_uuid will be generated. This id is used " | ||
| "through out the inference process and return in response." | ||
| ), | ||
| ) |
There was a problem hiding this comment.
Bug: UUID Inconsistency Causes vLLM Compatibility Issues
The TranscriptionRequest adds a request_id with a default UUID, which is inconsistent with other request types that mark this field for upstreaming to vLLM. This approach, questioned in a prior discussion, may cause compatibility issues with vLLM's backend.
| Union[CompletionStreamResponse, CompletionResponse, ErrorResponse], None | ||
| Union[str, CompletionStreamResponse, CompletionResponse, ErrorResponse], None | ||
| ], | ||
| ] |
There was a problem hiding this comment.
| # vLLM implementation for handling transcription requests: https://github.com/vllm-project/vllm/blob/0825197bee8dea547f2ab25f48afd8aea0cd2578/vllm/entrypoints/openai/api_server.py#L839. | ||
| async def transcriptions( | ||
| self, body: Annotated[TranscriptionRequest, Form()] | ||
| ) -> Response: |
There was a problem hiding this comment.
Bug: Audio File Serialization Issue in Ray Serve
The transcriptions endpoint uses Annotated[TranscriptionRequest, Form()] to handle form data, but the TranscriptionRequest object contains a file field that is an UploadFile. When this is passed through Ray Serve's remote call mechanism (in _get_response), the UploadFile object may not be serializable for pickling, which is required for Ray remote calls. This could cause serialization errors when the request is sent to the model deployment. The audio data should be extracted and passed as bytes before the remote call, similar to what's done in the vLLM engine at line 473.
|
mb, fixed the issue @kouroshHakha |
|
Release test All other release tests passing |
…ine backend (ray-project#57194) Signed-off-by: DPatel_7 <dpatel@gocommotion.com> Co-authored-by: DPatel_7 <dpatel@gocommotion.com> Signed-off-by: xgui <xgui@anyscale.com>
…ine backend (ray-project#57194) Signed-off-by: DPatel_7 <dpatel@gocommotion.com> Co-authored-by: DPatel_7 <dpatel@gocommotion.com>
…ine backend (ray-project#57194) Signed-off-by: DPatel_7 <dpatel@gocommotion.com> Co-authored-by: DPatel_7 <dpatel@gocommotion.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>
…ine backend (ray-project#57194) Signed-off-by: DPatel_7 <dpatel@gocommotion.com> Co-authored-by: DPatel_7 <dpatel@gocommotion.com> Signed-off-by: Future-Outlier <eric901201@gmail.com>
Why are these changes needed?
Expose an transcriptions API like [https://platform.openai.com/docs/api-reference/audio] using vLLM
Checks
git commit -s) in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.